{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": false
   },
   "source": [
    "# Sorting Instagram posts for manual annotation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the Instagram posts scraped using [Instaphyte](https://github.com/ScriptSmith/instaphyte/tree/master/instaphyte), for each site (Rollright Stones and Rufford Abbey):\n",
    "1. Collate unique posts from hashtag and location ID.\n",
    "2. Restrict posts to 1 May 2014&ndash;30 April 2019 (inclusive).\n",
    "3. Randomly sample 3,979 posts for Rufford Abbey to match number of posts for Rollright Stones.\n",
    "4. From the sample, randomly select 1,000 posts to be annotated.\n",
    "5. Distribute the selected 1,000 posts between four coders.\n",
    "6. Create new directories containing:\n",
    "    * ``sample`` images (4,979 from each site);\n",
    "    * images to be ``annot``ated (1,000 from each site);\n",
    "    * images to be annotated by ``coder`` (250 from each site)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import glob\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "import pickle\n",
    "import pytz\n",
    "import random\n",
    "import seaborn as sns\n",
    "import shutil\n",
    "import time\n",
    "\n",
    "from datetime import datetime\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "CWD = os.getcwd()\n",
    "# print(CWD)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "## Collate unique posts from hashtag and location IDs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse ``.csv`` files of scraped data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load each site's hashtag and location ID, and paths to corresponding directories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "ROLLRIGHT_TAGS = ['rollrightstones','1244762']\n",
    "RUFFORD_TAGS = ['ruffordabbey','276342562']\n",
    "DIR_PATHS = {}\n",
    "DIR_PATHS['csv'] = os.path.join(CWD,'Instaphyte')\n",
    "DIR_PATHS['rollright'] = os.path.join(CWD,'Instaphyte','RollrightStones')\n",
    "DIR_PATHS['rufford'] = os.path.join(CWD,'Instaphyte','RuffordAbbey')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function ``tagsDF(tags,dir_path)`` reads a ``.csv`` file (generated by Instaphyte; saved in ``dir_path``) corresponding to a scraped hashtag or location ID (listed in ``tags``) into a ``pandas DataFrame``, whose rows are posts and columns are each post's unique ID (https://www.instagram.com/p/{postid}/), Unix timestamp (which I convert into a timezone-aware ``datetime`` object), caption, user ID, number of likes, image height, and image width.\n",
    "\n",
    "On the initial run, entries for the following posts (identified by their unique ID) in the ``.csv`` were misaligned (i.e. caption entries overflowed onto more than one row in the file; possibly related to use of the newline character \"\\n\" preceding a hashtag or following an emoji or special character).\n",
    "- ``ruffordabbey.csv``: BfeMOtAA3GK, BO6bZ7MAT4t, _1XTO8upge\n",
    "\n",
    "These raised ``TypeError``s when deriving ``time`` from ``unix`` timestamp. I thus make corrections into a namesake ``.xlsx``, whose ``code`` (post ID) column must be formatted as text to avoid cells containing a leading dash being interpreted as ``#NAME?`` errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "CSVCOLS = ['node.shortcode','node.taken_at_timestamp',\n",
    "           'node.edge_media_to_caption.edges.0.node.text',\n",
    "           'node.owner.id', 'node.edge_liked_by.count',\n",
    "           'node.dimensions.height','node.dimensions.width']\n",
    "RENAMEDCOLS = ['code','unix','caption','ownerid','likes','height','width']\n",
    "TIMEZONE = pytz.timezone(\"Europe/London\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tagsDF(tags,dir_path):\n",
    "    \"\"\"\n",
    "    Input: tags(list)-site hashtag/s and location ID/s\n",
    "           dir_path(str)-path to directory containing .csv\n",
    "    Returns: DF with metadata on the site's scraped posts\n",
    "    \"\"\"\n",
    "    posts_df = pd.DataFrame()\n",
    "    for tag in tags:\n",
    "        if tag=='ruffordabbey': # Use the corrected .xlsx file\n",
    "            csv_df = pd.read_excel(os.path.join(dir_path,\"{}.xlsx\").format(tag))[CSVCOLS]\n",
    "        else:\n",
    "            csv_df = pd.read_csv(os.path.join(dir_path,\"{}.csv\").format(tag),usecols=CSVCOLS)[CSVCOLS]\n",
    "        csv_df.columns = RENAMEDCOLS\n",
    "        csv_df['time'] = csv_df.apply(lambda row: datetime.fromtimestamp(row['unix']).astimezone(TIMEZONE),axis=1)\n",
    "        print(\"{} posts for {}\".format(len(csv_df),tag))\n",
    "        posts_df = posts_df.append(csv_df)\n",
    "    return posts_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2153 posts for rollrightstones\n",
      "2864 posts for 1244762\n",
      "5235 posts for ruffordabbey\n",
      "3791 posts for 276342562\n"
     ]
    }
   ],
   "source": [
    "rollright_meta = tagsDF(ROLLRIGHT_TAGS,DIR_PATHS['csv'])\n",
    "rufford_meta = tagsDF(RUFFORD_TAGS,DIR_PATHS['csv'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5017 posts for Rollright Stones\n",
      "9026 posts for Rufford Abbey\n"
     ]
    }
   ],
   "source": [
    "print(\"{} posts for Rollright Stones\".format(len(rollright_meta)))\n",
    "print(\"{} posts for Rufford Abbey\".format(len(rufford_meta)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove duplicate posts (with same ``code``)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "def removeDuplicates(posts_df):\n",
    "    \"\"\"\n",
    "    Input: posts_df(DataFrame)-posts with unique IDs in column 'code'\n",
    "    Returns: DF with 'code' as index column and duplicate codes removed\n",
    "    \"\"\"\n",
    "    posts_df = posts_df.set_index('code')\n",
    "    return posts_df.loc[~posts_df.index.duplicated(keep='first')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_meta = removeDuplicates(rollright_meta)\n",
    "rufford_meta = removeDuplicates(rufford_meta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4132 posts for Rollright Stones\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BxKtlNug--7</th>\n",
       "      <td>1557245116</td>\n",
       "      <td>Rollright Stones.#rollrightstones #neolithic</td>\n",
       "      <td>2013399500</td>\n",
       "      <td>5</td>\n",
       "      <td>810</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-07 17:05:16+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIoj3ZnZ_y</th>\n",
       "      <td>1557175374</td>\n",
       "      <td>#thekingstone without #instagram cutting the t...</td>\n",
       "      <td>48772936</td>\n",
       "      <td>3</td>\n",
       "      <td>1350</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 21:42:54+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIocUrHpBm</th>\n",
       "      <td>1557175312</td>\n",
       "      <td>Thought this one looked like it had a piglet s...</td>\n",
       "      <td>48772936</td>\n",
       "      <td>6</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 21:41:52+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIoES-gi7q</th>\n",
       "      <td>1557175116</td>\n",
       "      <td>Is #groot in there? #rollrightstones</td>\n",
       "      <td>4726006031</td>\n",
       "      <td>4</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 21:38:36+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIEUREHC8M</th>\n",
       "      <td>1557156372</td>\n",
       "      <td>Our countryside is lovely #rollrightstones</td>\n",
       "      <td>48772936</td>\n",
       "      <td>7</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 16:26:12+01:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BxKtlNug--7  1557245116       Rollright Stones.#rollrightstones #neolithic   \n",
       "BxIoj3ZnZ_y  1557175374  #thekingstone without #instagram cutting the t...   \n",
       "BxIocUrHpBm  1557175312  Thought this one looked like it had a piglet s...   \n",
       "BxIoES-gi7q  1557175116               Is #groot in there? #rollrightstones   \n",
       "BxIEUREHC8M  1557156372         Our countryside is lovely #rollrightstones   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \n",
       "code                                                                     \n",
       "BxKtlNug--7  2013399500      5     810   1080 2019-05-07 17:05:16+01:00  \n",
       "BxIoj3ZnZ_y    48772936      3    1350   1080 2019-05-06 21:42:54+01:00  \n",
       "BxIocUrHpBm    48772936      6    1080   1080 2019-05-06 21:41:52+01:00  \n",
       "BxIoES-gi7q  4726006031      4    1080   1080 2019-05-06 21:38:36+01:00  \n",
       "BxIEUREHC8M    48772936      7    1080   1080 2019-05-06 16:26:12+01:00  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8064 posts for Rufford Abbey\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BxLVuryh1-o</th>\n",
       "      <td>1557266165</td>\n",
       "      <td>#daysout #friends #walks #ducks #nofilterneede...</td>\n",
       "      <td>8724751</td>\n",
       "      <td>4</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-07 22:56:05+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxKJSTPBBiz</th>\n",
       "      <td>1557226086</td>\n",
       "      <td>ðŸƒâœ¨Living her best lifeðŸŒˆðŸ„ #ruffordab...</td>\n",
       "      <td>52647925</td>\n",
       "      <td>15</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-07 11:48:06+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxI1RJ7hxBB</th>\n",
       "      <td>1557182037</td>\n",
       "      <td>A lovely family day at Rufford Abbey with the ...</td>\n",
       "      <td>9435530728</td>\n",
       "      <td>28</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 23:33:57+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIvKBDjblw</th>\n",
       "      <td>1557178832</td>\n",
       "      <td>Weâ€™ve had a great weekend in the caravan, wa...</td>\n",
       "      <td>5694632845</td>\n",
       "      <td>20</td>\n",
       "      <td>1350</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-05-06 22:40:32+01:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BxIkQNRBaWo</th>\n",
       "      <td>1557173116</td>\n",
       "      <td>#ruffordabbey ðŸƒðŸ¦†ðŸŒ³</td>\n",
       "      <td>1951606466</td>\n",
       "      <td>12</td>\n",
       "      <td>937</td>\n",
       "      <td>750</td>\n",
       "      <td>2019-05-06 21:05:16+01:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BxLVuryh1-o  1557266165  #daysout #friends #walks #ducks #nofilterneede...   \n",
       "BxKJSTPBBiz  1557226086  ðŸƒâœ¨Living her best lifeðŸŒˆðŸ„ #ruffordab...   \n",
       "BxI1RJ7hxBB  1557182037  A lovely family day at Rufford Abbey with the ...   \n",
       "BxIvKBDjblw  1557178832  Weâ€™ve had a great weekend in the caravan, wa...   \n",
       "BxIkQNRBaWo  1557173116                         #ruffordabbey ðŸƒðŸ¦†ðŸŒ³   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \n",
       "code                                                                     \n",
       "BxLVuryh1-o     8724751      4    1080   1080 2019-05-07 22:56:05+01:00  \n",
       "BxKJSTPBBiz    52647925     15    1080   1080 2019-05-07 11:48:06+01:00  \n",
       "BxI1RJ7hxBB  9435530728     28    1080   1080 2019-05-06 23:33:57+01:00  \n",
       "BxIvKBDjblw  5694632845     20    1350   1080 2019-05-06 22:40:32+01:00  \n",
       "BxIkQNRBaWo  1951606466     12     937    750 2019-05-06 21:05:16+01:00  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"{} posts for Rollright Stones\".format(len(rollright_meta)))\n",
    "display(rollright_meta.head())\n",
    "print(\"{} posts for Rufford Abbey\".format(len(rufford_meta)))\n",
    "display(rufford_meta.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read directories of scraped images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "def imagesDF(tags,dir_path):\n",
    "    image_paths = []\n",
    "    for tag in tags:\n",
    "        image_paths += sorted(glob.glob(os.path.join(dir_path,tag,'*.jpg')))\n",
    "    images_df = pd.DataFrame(image_paths, columns=['path'])\n",
    "    images_df['code'] = images_df.apply(lambda row: os.path.basename(row['path'][:-4]),axis=1)\n",
    "    return images_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_path = imagesDF(ROLLRIGHT_TAGS,DIR_PATHS['rollright'])\n",
    "rufford_path = imagesDF(RUFFORD_TAGS,DIR_PATHS['rufford'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5017 images for Rollright Stones\n",
      "9023 images for Rufford Abbey\n"
     ]
    }
   ],
   "source": [
    "print(\"{} images for Rollright Stones\".format(len(rollright_path)))\n",
    "print(\"{} images for Rufford Abbey\".format(len(rufford_path)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_path = removeDuplicates(rollright_path)\n",
    "rufford_path = removeDuplicates(rufford_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4132 images for Rollright Stones\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-9fr-UQULu</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-GiikRp01G</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-WLVcDQ2_f</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-WLiDsQ2_x</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-hvFNemG61</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                         path\n",
       "code                                                         \n",
       "-9fr-UQULu  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-GiikRp01G  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-WLVcDQ2_f  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-WLiDsQ2_x  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-hvFNemG61  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8061 images for Rufford Abbey\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-HZmCBuRXf</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-HZtbZuRXz</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-IzjJ4TWbk</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-Jsey7IAld</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-KM4MySRbI</th>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                         path\n",
       "code                                                         \n",
       "-HZmCBuRXf  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-HZtbZuRXz  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-IzjJ4TWbk  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-Jsey7IAld  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...\n",
       "-KM4MySRbI  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"{} images for Rollright Stones\".format(len(rollright_path)))\n",
    "display(rollright_path.head())\n",
    "print(\"{} images for Rufford Abbey\".format(len(rufford_path)))\n",
    "display(rufford_path.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge DataFrames of image metadata and file paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_df = rollright_meta.merge(rollright_path,left_index=True,right_index=True,how='inner')\n",
    "rufford_df = rufford_meta.merge(rufford_path,left_index=True,right_index=True,how='inner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that number of common codes is as expected:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "print(len(list(set(rollright_meta.index) & set(rollright_path.index))) == len(rollright_df))\n",
    "print(len(list(set(rufford_meta.index) & set(rufford_path.index))) == len(rufford_df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4132 images for Rollright Stones\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Hn1t7</th>\n",
       "      <td>1310631805</td>\n",
       "      <td>Roll right stones. HDR pro app</td>\n",
       "      <td>630556</td>\n",
       "      <td>63</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-07-14 09:23:25+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hn15a</th>\n",
       "      <td>1310631884</td>\n",
       "      <td>Rollright stones. HDR Pro app</td>\n",
       "      <td>630556</td>\n",
       "      <td>94</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-07-14 09:24:44+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>QAGiL</th>\n",
       "      <td>1318507690</td>\n",
       "      <td>Been to visit Rollright stones. #mystical</td>\n",
       "      <td>3642067</td>\n",
       "      <td>7</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-13 13:08:10+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>QAG5p</th>\n",
       "      <td>1318507784</td>\n",
       "      <td>Visiting Rollright stones.#mystical</td>\n",
       "      <td>3642067</td>\n",
       "      <td>29</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-13 13:09:44+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>QAHIU</th>\n",
       "      <td>1318507843</td>\n",
       "      <td>Kate and Jazz at Rollright stones.#mystical</td>\n",
       "      <td>3642067</td>\n",
       "      <td>23</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-13 13:10:43+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             unix                                       caption  ownerid  \\\n",
       "code                                                                       \n",
       "Hn1t7  1310631805                Roll right stones. HDR pro app   630556   \n",
       "Hn15a  1310631884                 Rollright stones. HDR Pro app   630556   \n",
       "QAGiL  1318507690     Been to visit Rollright stones. #mystical  3642067   \n",
       "QAG5p  1318507784          Visiting Rollright stones.#mystical   3642067   \n",
       "QAHIU  1318507843  Kate and Jazz at Rollright stones.#mystical   3642067   \n",
       "\n",
       "       likes  height  width                      time  \\\n",
       "code                                                    \n",
       "Hn1t7     63     612    612 2011-07-14 09:23:25+01:00   \n",
       "Hn15a     94     612    612 2011-07-14 09:24:44+01:00   \n",
       "QAGiL      7     612    612 2011-10-13 13:08:10+01:00   \n",
       "QAG5p     29     612    612 2011-10-13 13:09:44+01:00   \n",
       "QAHIU     23     612    612 2011-10-13 13:10:43+01:00   \n",
       "\n",
       "                                                    path  \n",
       "code                                                      \n",
       "Hn1t7  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "Hn15a  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "QAGiL  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "QAG5p  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "QAHIU  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8061 images for Rufford Abbey\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Pgi2o</th>\n",
       "      <td>1318105182</td>\n",
       "      <td>#bw #blackandwhite</td>\n",
       "      <td>8999729</td>\n",
       "      <td>11</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-08 21:19:42+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pgjj1</th>\n",
       "      <td>1318105278</td>\n",
       "      <td>#ruffordabbey</td>\n",
       "      <td>8999729</td>\n",
       "      <td>14</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-08 21:21:18+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pgk73</th>\n",
       "      <td>1318105472</td>\n",
       "      <td>#ruffordabbey</td>\n",
       "      <td>8999729</td>\n",
       "      <td>5</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-08 21:24:32+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pg38l</th>\n",
       "      <td>1318108129</td>\n",
       "      <td>#RuffordAbbey</td>\n",
       "      <td>8999729</td>\n",
       "      <td>6</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-08 22:08:49+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pk4IX</th>\n",
       "      <td>1318147677</td>\n",
       "      <td>#ruffordabbey</td>\n",
       "      <td>8999729</td>\n",
       "      <td>4</td>\n",
       "      <td>612</td>\n",
       "      <td>612</td>\n",
       "      <td>2011-10-09 09:07:57+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             unix             caption  ownerid  likes  height  width  \\\n",
       "code                                                                   \n",
       "Pgi2o  1318105182  #bw #blackandwhite  8999729     11     612    612   \n",
       "Pgjj1  1318105278       #ruffordabbey  8999729     14     612    612   \n",
       "Pgk73  1318105472       #ruffordabbey  8999729      5     612    612   \n",
       "Pg38l  1318108129       #RuffordAbbey  8999729      6     612    612   \n",
       "Pk4IX  1318147677       #ruffordabbey  8999729      4     612    612   \n",
       "\n",
       "                           time  \\\n",
       "code                              \n",
       "Pgi2o 2011-10-08 21:19:42+01:00   \n",
       "Pgjj1 2011-10-08 21:21:18+01:00   \n",
       "Pgk73 2011-10-08 21:24:32+01:00   \n",
       "Pg38l 2011-10-08 22:08:49+01:00   \n",
       "Pk4IX 2011-10-09 09:07:57+01:00   \n",
       "\n",
       "                                                    path  \n",
       "code                                                      \n",
       "Pgi2o  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "Pgjj1  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "Pgk73  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "Pg38l  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "Pk4IX  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"{} images for Rollright Stones\".format(len(rollright_df)))\n",
    "display(rollright_df.sort_values('time').head())\n",
    "print(\"{} images for Rufford Abbey\".format(len(rufford_df)))\n",
    "display(rufford_df.sort_values('time').head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note_: 3 images in rufford_path could not be downloaded from Instagram as their .jpg's aren't available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['BlaTpr7B0mz', 'BjYfVOPgBZu', 'BZJtHSqHm6G']\n"
     ]
    }
   ],
   "source": [
    "print([x for x in list(rufford_meta.index) if x not in set(list(rufford_path.index))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('rollright_df.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_df,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_df.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_df,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Restrict posts to 1 May 2014&ndash;30 April 2019 (inclusive)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [],
   "source": [
    "STARTDATE = datetime(2014,5,1,0,0,0).astimezone(TIMEZONE)\n",
    "ENDDATE = datetime(2019,4,30,23,59,59).astimezone(TIMEZONE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright5_df = rollright_df.loc[(rollright_df['time']>=STARTDATE) & (rollright_df['time']<=ENDDATE)]\n",
    "rufford5_df = rufford_df.loc[(rufford_df['time']>=STARTDATE) & (rufford_df['time']<=ENDDATE)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3979 images for Rollright Stones\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>nyDOgMC6XW</th>\n",
       "      <td>1399651297</td>\n",
       "      <td>#cotswolds #manorcottages #littlebarn #ascott ...</td>\n",
       "      <td>721060744</td>\n",
       "      <td>26</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-09 17:01:37+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>oISna_quUH</th>\n",
       "      <td>1400397563</td>\n",
       "      <td>The kings men #rollrightstones</td>\n",
       "      <td>318487744</td>\n",
       "      <td>0</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-18 08:19:23+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>oISuxnquUP</th>\n",
       "      <td>1400397623</td>\n",
       "      <td>The whispering knights #rollrightstones</td>\n",
       "      <td>318487744</td>\n",
       "      <td>1</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-18 08:20:23+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>oIS3K2KuUZ</th>\n",
       "      <td>1400397692</td>\n",
       "      <td>The King stone #rollrightstones</td>\n",
       "      <td>318487744</td>\n",
       "      <td>0</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-18 08:21:32+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>oJrqV1rXZx</th>\n",
       "      <td>1400444248</td>\n",
       "      <td>#rollrightstones #england @vazzninja</td>\n",
       "      <td>282182908</td>\n",
       "      <td>2</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-18 21:17:28+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  unix                                            caption  \\\n",
       "code                                                                        \n",
       "nyDOgMC6XW  1399651297  #cotswolds #manorcottages #littlebarn #ascott ...   \n",
       "oISna_quUH  1400397563                     The kings men #rollrightstones   \n",
       "oISuxnquUP  1400397623            The whispering knights #rollrightstones   \n",
       "oIS3K2KuUZ  1400397692                    The King stone #rollrightstones   \n",
       "oJrqV1rXZx  1400444248               #rollrightstones #england @vazzninja   \n",
       "\n",
       "              ownerid  likes  height  width                      time  \\\n",
       "code                                                                    \n",
       "nyDOgMC6XW  721060744     26     640    640 2014-05-09 17:01:37+01:00   \n",
       "oISna_quUH  318487744      0     640    640 2014-05-18 08:19:23+01:00   \n",
       "oISuxnquUP  318487744      1     640    640 2014-05-18 08:20:23+01:00   \n",
       "oIS3K2KuUZ  318487744      0     640    640 2014-05-18 08:21:32+01:00   \n",
       "oJrqV1rXZx  282182908      2     640    640 2014-05-18 21:17:28+01:00   \n",
       "\n",
       "                                                         path  \n",
       "code                                                           \n",
       "nyDOgMC6XW  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "oISna_quUH  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "oISuxnquUP  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "oIS3K2KuUZ  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "oJrqV1rXZx  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7783 images for Rufford Abbey\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>nbp290QjcS</th>\n",
       "      <td>1398899799</td>\n",
       "      <td>A wildflower growing amongst the bluebells at ...</td>\n",
       "      <td>185294236</td>\n",
       "      <td>159</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-01 00:16:39+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>njM8zwQjR3</th>\n",
       "      <td>1399153078</td>\n",
       "      <td>Shot of the lake at Rufford Abbey and country ...</td>\n",
       "      <td>185294236</td>\n",
       "      <td>160</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-03 22:37:58+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nkWPvLItcu</th>\n",
       "      <td>1399191506</td>\n",
       "      <td>#Sundial #RuffordAbbey</td>\n",
       "      <td>187637228</td>\n",
       "      <td>1</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-04 09:18:26+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nkWXbwItc6</th>\n",
       "      <td>1399191569</td>\n",
       "      <td>#gargoyle #RuffordAbbey</td>\n",
       "      <td>187637228</td>\n",
       "      <td>1</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-04 09:19:29+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nkWv9eItdT</th>\n",
       "      <td>1399191770</td>\n",
       "      <td>#Bluebells #RuffordAbbey</td>\n",
       "      <td>187637228</td>\n",
       "      <td>0</td>\n",
       "      <td>640</td>\n",
       "      <td>640</td>\n",
       "      <td>2014-05-04 09:22:50+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  unix                                            caption  \\\n",
       "code                                                                        \n",
       "nbp290QjcS  1398899799  A wildflower growing amongst the bluebells at ...   \n",
       "njM8zwQjR3  1399153078  Shot of the lake at Rufford Abbey and country ...   \n",
       "nkWPvLItcu  1399191506                             #Sundial #RuffordAbbey   \n",
       "nkWXbwItc6  1399191569                            #gargoyle #RuffordAbbey   \n",
       "nkWv9eItdT  1399191770                           #Bluebells #RuffordAbbey   \n",
       "\n",
       "              ownerid  likes  height  width                      time  \\\n",
       "code                                                                    \n",
       "nbp290QjcS  185294236    159     640    640 2014-05-01 00:16:39+01:00   \n",
       "njM8zwQjR3  185294236    160     640    640 2014-05-03 22:37:58+01:00   \n",
       "nkWPvLItcu  187637228      1     640    640 2014-05-04 09:18:26+01:00   \n",
       "nkWXbwItc6  187637228      1     640    640 2014-05-04 09:19:29+01:00   \n",
       "nkWv9eItdT  187637228      0     640    640 2014-05-04 09:22:50+01:00   \n",
       "\n",
       "                                                         path  \n",
       "code                                                           \n",
       "nbp290QjcS  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "njM8zwQjR3  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "nkWPvLItcu  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "nkWXbwItc6  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  \n",
       "nkWv9eItdT  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"{} images for Rollright Stones\".format(len(rollright5_df)))\n",
    "display(rollright5_df.sort_values('time').head())\n",
    "print(\"{} images for Rufford Abbey\".format(len(rufford5_df)))\n",
    "display(rufford5_df.sort_values('time').head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('rollright5_df.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright5_df,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford5_df.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford5_df,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly sample 3,979 posts (for Rufford Abbey)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "rollright_sample = rollright5_df.copy()\n",
    "rufford_sample = rufford5_df.copy()\n",
    "rufford_sample = rufford_sample.sample(3979)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3979\n",
      "3979\n"
     ]
    }
   ],
   "source": [
    "print(len(rollright_sample))\n",
    "print(len(rufford_sample))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "with open('rollright_sample.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_sample,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_sample.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_sample,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('rollright_sample.pickle', 'rb') as f:\n",
    "    rollright_sample = pickle.load(f)\n",
    "with open('rufford_sample.pickle', 'rb') as f:\n",
    "    rufford_sample = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create new directories containing sample images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Copied 500th image\n",
      "Copied 1000th image\n",
      "Copied 1500th image\n",
      "Copied 2000th image\n",
      "Copied 2500th image\n",
      "Copied 3000th image\n",
      "Copied 3500th image\n",
      "Copied 3979th image\n"
     ]
    }
   ],
   "source": [
    "rollright_sample_dir = os.path.join(CWD,\"rollright_sample\")\n",
    "if not os.path.exists(rollright_sample_dir):\n",
    "    os.mkdir(rollright_sample_dir)\n",
    "for i, path in enumerate(rollright_sample['path']):\n",
    "    shutil.copy2(path, rollright_sample_dir)\n",
    "    if (i+1)%500==0 or (i+1)==len(rollright_sample['path']):\n",
    "        print('Copied {}th image'.format(i+1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Copied 500th image\n",
      "Copied 1000th image\n",
      "Copied 1500th image\n",
      "Copied 2000th image\n",
      "Copied 2500th image\n",
      "Copied 3000th image\n",
      "Copied 3500th image\n",
      "Copied 3979th image\n"
     ]
    }
   ],
   "source": [
    "rufford_sample_dir = os.path.join(CWD,\"rufford_sample\")\n",
    "if not os.path.exists(rufford_sample_dir):\n",
    "    os.mkdir(rufford_sample_dir)\n",
    "for i, path in enumerate(rufford_sample['path']):\n",
    "    shutil.copy2(path, rufford_sample_dir)\n",
    "    if (i+1)%500==0 or (i+1)==len(rufford_sample['path']):\n",
    "        print('Copied {}th image'.format(i+1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly select 1,000 posts to be annotated"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "rollright_annot = rollright_sample.copy()\n",
    "rufford_annot = rufford_sample.copy()\n",
    "rollright_annot = rollright_annot.sample(1000)\n",
    "rufford_annot = rufford_annot.sample(1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "with open('rollright_annot.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_annot,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_annot.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_annot,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create new column ``annot`` in ``rollright_sample`` and ``rufford_sample`` indicating if post is to be annotated (`True`) or not (`False`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_sample['annot'] = rollright_sample.apply(lambda row: row['path'] in list(rollright_annot['path']),axis=1)\n",
    "rufford_sample['annot'] = rufford_sample.apply(lambda row: row['path'] in list(rufford_annot['path']),axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>annot</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bw38_jcHR6Z</th>\n",
       "      <td>1556615661</td>\n",
       "      <td>Group shot: The Whispering Knights do their be...</td>\n",
       "      <td>7289076068</td>\n",
       "      <td>96</td>\n",
       "      <td>607</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-04-30 10:14:21+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bw3oz1Eje3t</th>\n",
       "      <td>1556605331</td>\n",
       "      <td>The Kings Men Stone Circle .\\r\\n.\\r\\n. . .  #r...</td>\n",
       "      <td>6213262605</td>\n",
       "      <td>54</td>\n",
       "      <td>421</td>\n",
       "      <td>750</td>\n",
       "      <td>2019-04-30 07:22:11+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bw2josylFVB</th>\n",
       "      <td>1556568813</td>\n",
       "      <td>No more peaceful a place to watch the sun set ...</td>\n",
       "      <td>9003677977</td>\n",
       "      <td>60</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-04-29 21:13:33+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bw0Fl27nd74</th>\n",
       "      <td>1556485952</td>\n",
       "      <td>Family fun day out #familyday #oxfordshire #to...</td>\n",
       "      <td>239238200</td>\n",
       "      <td>9</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-04-28 22:12:32+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BwyjnrKHbPm</th>\n",
       "      <td>1556434586</td>\n",
       "      <td>On the way home from Warwick we went to see th...</td>\n",
       "      <td>5365410561</td>\n",
       "      <td>36</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-04-28 07:56:26+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "Bw38_jcHR6Z  1556615661  Group shot: The Whispering Knights do their be...   \n",
       "Bw3oz1Eje3t  1556605331  The Kings Men Stone Circle .\\r\\n.\\r\\n. . .  #r...   \n",
       "Bw2josylFVB  1556568813  No more peaceful a place to watch the sun set ...   \n",
       "Bw0Fl27nd74  1556485952  Family fun day out #familyday #oxfordshire #to...   \n",
       "BwyjnrKHbPm  1556434586  On the way home from Warwick we went to see th...   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "Bw38_jcHR6Z  7289076068     96     607   1080 2019-04-30 10:14:21+01:00   \n",
       "Bw3oz1Eje3t  6213262605     54     421    750 2019-04-30 07:22:11+01:00   \n",
       "Bw2josylFVB  9003677977     60    1080   1080 2019-04-29 21:13:33+01:00   \n",
       "Bw0Fl27nd74   239238200      9    1080   1080 2019-04-28 22:12:32+01:00   \n",
       "BwyjnrKHbPm  5365410561     36    1080   1080 2019-04-28 07:56:26+01:00   \n",
       "\n",
       "                                                          path  annot  \n",
       "code                                                                   \n",
       "Bw38_jcHR6Z  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...   True  \n",
       "Bw3oz1Eje3t  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "Bw2josylFVB  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...   True  \n",
       "Bw0Fl27nd74  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "BwyjnrKHbPm  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>annot</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bfn879DHYgY</th>\n",
       "      <td>1519571539</td>\n",
       "      <td>#robin #ruffordabbey #wildlifephotography #bir...</td>\n",
       "      <td>179728608</td>\n",
       "      <td>19</td>\n",
       "      <td>810</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-02-25 15:12:19+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BrLlDfdn2tw</th>\n",
       "      <td>1544389298</td>\n",
       "      <td>our third Christmas 🖤</td>\n",
       "      <td>2954770963</td>\n",
       "      <td>91</td>\n",
       "      <td>809</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-12-09 21:01:38+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BSffC3kFcfH</th>\n",
       "      <td>1491370144</td>\n",
       "      <td>#redbrick #orangery #ruffordabbey #nottinghams...</td>\n",
       "      <td>4062868979</td>\n",
       "      <td>108</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-04-05 06:29:04+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BflcvA3A6iu</th>\n",
       "      <td>1519487547</td>\n",
       "      <td>🌤 Getting out and about before the big chill h...</td>\n",
       "      <td>5605659</td>\n",
       "      <td>20</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-02-24 15:52:27+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BWUrd9Fld8a</th>\n",
       "      <td>1499597493</td>\n",
       "      <td>My Beautiful Family ❤️❤️</td>\n",
       "      <td>15339662</td>\n",
       "      <td>22</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-07-09 11:51:33+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "Bfn879DHYgY  1519571539  #robin #ruffordabbey #wildlifephotography #bir...   \n",
       "BrLlDfdn2tw  1544389298                              our third Christmas 🖤   \n",
       "BSffC3kFcfH  1491370144  #redbrick #orangery #ruffordabbey #nottinghams...   \n",
       "BflcvA3A6iu  1519487547  🌤 Getting out and about before the big chill h...   \n",
       "BWUrd9Fld8a  1499597493                           My Beautiful Family ❤️❤️   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "Bfn879DHYgY   179728608     19     810   1080 2018-02-25 15:12:19+00:00   \n",
       "BrLlDfdn2tw  2954770963     91     809   1080 2018-12-09 21:01:38+00:00   \n",
       "BSffC3kFcfH  4062868979    108    1080   1080 2017-04-05 06:29:04+01:00   \n",
       "BflcvA3A6iu     5605659     20    1080   1080 2018-02-24 15:52:27+00:00   \n",
       "BWUrd9Fld8a    15339662     22    1080   1080 2017-07-09 11:51:33+01:00   \n",
       "\n",
       "                                                          path  annot  \n",
       "code                                                                   \n",
       "Bfn879DHYgY  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "BrLlDfdn2tw  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "BSffC3kFcfH  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "BflcvA3A6iu  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...  False  \n",
       "BWUrd9Fld8a  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...   True  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(rollright_sample.head())\n",
    "display(rufford_sample.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that 1,000 images are indeed to be annotated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000\n",
      "1000\n"
     ]
    }
   ],
   "source": [
    "print(rollright_sample['annot'].sum())\n",
    "print(rufford_sample['annot'].sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 286,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # to avoid re-running cell\n",
    "with open('rollright_sample.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_sample,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_sample.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_sample,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly distribute the selected 1,000 posts between four coders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Randomly generate lists of four ``coder_1`` IDs, each coding 250 images for each site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "rollright_coders = [1,2,3,4]*250\n",
    "random.shuffle(rollright_coders)\n",
    "rufford_coders = [1,2,3,4]*250\n",
    "random.shuffle(rufford_coders)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "with open('rollright_coders.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_coders,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_coders.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_coders,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_annot['coder_1'] = rollright_coders\n",
    "rufford_annot['coder_1'] = rufford_coders"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>coder_1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BVnd0UUlTmv</th>\n",
       "      <td>1498080387</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4153003979</td>\n",
       "      <td>1</td>\n",
       "      <td>608</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-06-21 22:26:27+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BHKShUjD_Ep</th>\n",
       "      <td>1467036615</td>\n",
       "      <td>In ancient times...\\r\\nHundreds of years befor...</td>\n",
       "      <td>10344507</td>\n",
       "      <td>8</td>\n",
       "      <td>611</td>\n",
       "      <td>1080</td>\n",
       "      <td>2016-06-27 15:10:15+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BYfb69XBCFY</th>\n",
       "      <td>1504253408</td>\n",
       "      <td>Missed the sunrise but it was still pretty 😂🌅 ...</td>\n",
       "      <td>1025124696</td>\n",
       "      <td>85</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-09-01 09:10:08+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BdnZR_KBmaz</th>\n",
       "      <td>1515257878</td>\n",
       "      <td>Today’s trip was to the Rollright Stones. This...</td>\n",
       "      <td>5926038790</td>\n",
       "      <td>20</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-01-06 16:57:58+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BuQyW36nusS</th>\n",
       "      <td>1551006495</td>\n",
       "      <td>‘The ROLLRIGHT STONES: Fact vs Fantasy at our ...</td>\n",
       "      <td>7289076068</td>\n",
       "      <td>103</td>\n",
       "      <td>607</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-02-24 11:08:15+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BVnd0UUlTmv  1498080387                                                NaN   \n",
       "BHKShUjD_Ep  1467036615  In ancient times...\\r\\nHundreds of years befor...   \n",
       "BYfb69XBCFY  1504253408  Missed the sunrise but it was still pretty 😂🌅 ...   \n",
       "BdnZR_KBmaz  1515257878  Today’s trip was to the Rollright Stones. This...   \n",
       "BuQyW36nusS  1551006495  ‘The ROLLRIGHT STONES: Fact vs Fantasy at our ...   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "BVnd0UUlTmv  4153003979      1     608   1080 2017-06-21 22:26:27+01:00   \n",
       "BHKShUjD_Ep    10344507      8     611   1080 2016-06-27 15:10:15+01:00   \n",
       "BYfb69XBCFY  1025124696     85    1080   1080 2017-09-01 09:10:08+01:00   \n",
       "BdnZR_KBmaz  5926038790     20    1080   1080 2018-01-06 16:57:58+00:00   \n",
       "BuQyW36nusS  7289076068    103     607   1080 2019-02-24 11:08:15+00:00   \n",
       "\n",
       "                                                          path  coder_1  \n",
       "code                                                                     \n",
       "BVnd0UUlTmv  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        1  \n",
       "BHKShUjD_Ep  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        1  \n",
       "BYfb69XBCFY  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        4  \n",
       "BdnZR_KBmaz  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3  \n",
       "BuQyW36nusS  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>coder_1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BhmLShznCPM</th>\n",
       "      <td>1523806922</td>\n",
       "      <td>I took so many pictures at Rufford yesterday d...</td>\n",
       "      <td>20251612</td>\n",
       "      <td>141</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-04-15 16:42:02+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bo9oO4ngzse</th>\n",
       "      <td>1539626234</td>\n",
       "      <td>Out ya foot in it!!\\r\\n#walknotts #walking #hi...</td>\n",
       "      <td>7300930727</td>\n",
       "      <td>9</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-10-15 18:57:14+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BWs6SMMlHb8</th>\n",
       "      <td>1500410568</td>\n",
       "      <td>#pathway #ruffordabbey #trees #treelinedpath #...</td>\n",
       "      <td>3308977107</td>\n",
       "      <td>12</td>\n",
       "      <td>1350</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-07-18 21:42:48+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BYWvnvtlCyc</th>\n",
       "      <td>1503961747</td>\n",
       "      <td>Bank holiday done! Bring on a week's annual le...</td>\n",
       "      <td>2032802212</td>\n",
       "      <td>5</td>\n",
       "      <td>1349</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-08-29 00:09:07+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BeyKB_FF76f</th>\n",
       "      <td>1517766465</td>\n",
       "      <td>Magpie looking as haunting as ever watching ov...</td>\n",
       "      <td>20251612</td>\n",
       "      <td>90</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-02-04 17:47:45+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BhmLShznCPM  1523806922  I took so many pictures at Rufford yesterday d...   \n",
       "Bo9oO4ngzse  1539626234  Out ya foot in it!!\\r\\n#walknotts #walking #hi...   \n",
       "BWs6SMMlHb8  1500410568  #pathway #ruffordabbey #trees #treelinedpath #...   \n",
       "BYWvnvtlCyc  1503961747  Bank holiday done! Bring on a week's annual le...   \n",
       "BeyKB_FF76f  1517766465  Magpie looking as haunting as ever watching ov...   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "BhmLShznCPM    20251612    141    1080   1080 2018-04-15 16:42:02+01:00   \n",
       "Bo9oO4ngzse  7300930727      9    1080   1080 2018-10-15 18:57:14+01:00   \n",
       "BWs6SMMlHb8  3308977107     12    1350   1080 2017-07-18 21:42:48+01:00   \n",
       "BYWvnvtlCyc  2032802212      5    1349   1080 2017-08-29 00:09:07+01:00   \n",
       "BeyKB_FF76f    20251612     90    1080   1080 2018-02-04 17:47:45+00:00   \n",
       "\n",
       "                                                          path  coder_1  \n",
       "code                                                                     \n",
       "BhmLShznCPM  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        2  \n",
       "Bo9oO4ngzse  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3  \n",
       "BWs6SMMlHb8  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3  \n",
       "BYWvnvtlCyc  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        4  \n",
       "BeyKB_FF76f  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(rollright_annot.head())\n",
    "display(rufford_annot.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Randomly select ``coder_2`` for each image for each site, ensuring that ``coder_2``s ID is different from the ``coder_1``'s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 244,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_coders_2 = [1,2,3,4]*255\n",
    "random.shuffle(rollright_coders_2)\n",
    "rufford_coders_2 = [1,2,3,4]*255\n",
    "random.shuffle(rufford_coders_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 245,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getCoder2(coder1,coder2ids):\n",
    "    \"\"\"\n",
    "    Input: coder1(int)-ID of coder_1\n",
    "           possible_coder2(list)-possible IDs of coder_2\n",
    "    Returns: coder2(int)-ID of coder_2\n",
    "             remaining_coder2(list)-remaining possible IDs of coder_2\n",
    "    \"\"\"\n",
    "    possible_coder2 = [y for y in coder2ids if y != coder1]\n",
    "    coder2 = random.choice(possible_coder2)\n",
    "    return coder2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 226,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_coder2id = [] # store list of coder_2 IDs\n",
    "for coder1 in list(rollright_annot['coder_1']):\n",
    "    coder2 = getCoder2(coder1,rollright_coders_2)\n",
    "    rollright_coders_2.remove(coder2)\n",
    "    rollright_coder2id.append(coder2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 246,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = 0\n",
    "while result==0:\n",
    "    try:\n",
    "        rollright_coder2id = [] # store list of coder_2 IDs\n",
    "        for coder1 in list(rollright_annot['coder_1']):\n",
    "            coder2 = getCoder2(coder1,rollright_coders_2)\n",
    "            rollright_coders_2.remove(coder2)\n",
    "            rollright_coder2id.append(coder2)\n",
    "        result=1\n",
    "    except IndexError:\n",
    "         pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 247,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = 0\n",
    "while result==0:\n",
    "    try:\n",
    "        rufford_coder2id = [] # store list of coder_2 IDs\n",
    "        for coder1 in list(rufford_annot['coder_1']):\n",
    "            coder2 = getCoder2(coder1,rufford_coders_2)\n",
    "            rufford_coders_2.remove(coder2)\n",
    "            rufford_coder2id.append(coder2)\n",
    "        result=1\n",
    "    except IndexError:\n",
    "         pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 252,
   "metadata": {},
   "outputs": [],
   "source": [
    "rollright_annot['coder_2'] = rollright_coder2id\n",
    "rufford_annot['coder_2'] = rufford_coder2id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 253,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>coder_1</th>\n",
       "      <th>coder_2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BVnd0UUlTmv</th>\n",
       "      <td>1498080387</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4153003979</td>\n",
       "      <td>1</td>\n",
       "      <td>608</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-06-21 22:26:27+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BHKShUjD_Ep</th>\n",
       "      <td>1467036615</td>\n",
       "      <td>In ancient times...\\r\\nHundreds of years befor...</td>\n",
       "      <td>10344507</td>\n",
       "      <td>8</td>\n",
       "      <td>611</td>\n",
       "      <td>1080</td>\n",
       "      <td>2016-06-27 15:10:15+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BYfb69XBCFY</th>\n",
       "      <td>1504253408</td>\n",
       "      <td>Missed the sunrise but it was still pretty 😂🌅 ...</td>\n",
       "      <td>1025124696</td>\n",
       "      <td>85</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-09-01 09:10:08+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BdnZR_KBmaz</th>\n",
       "      <td>1515257878</td>\n",
       "      <td>Today’s trip was to the Rollright Stones. This...</td>\n",
       "      <td>5926038790</td>\n",
       "      <td>20</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-01-06 16:57:58+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BuQyW36nusS</th>\n",
       "      <td>1551006495</td>\n",
       "      <td>‘The ROLLRIGHT STONES: Fact vs Fantasy at our ...</td>\n",
       "      <td>7289076068</td>\n",
       "      <td>103</td>\n",
       "      <td>607</td>\n",
       "      <td>1080</td>\n",
       "      <td>2019-02-24 11:08:15+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BVnd0UUlTmv  1498080387                                                NaN   \n",
       "BHKShUjD_Ep  1467036615  In ancient times...\\r\\nHundreds of years befor...   \n",
       "BYfb69XBCFY  1504253408  Missed the sunrise but it was still pretty 😂🌅 ...   \n",
       "BdnZR_KBmaz  1515257878  Today’s trip was to the Rollright Stones. This...   \n",
       "BuQyW36nusS  1551006495  ‘The ROLLRIGHT STONES: Fact vs Fantasy at our ...   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "BVnd0UUlTmv  4153003979      1     608   1080 2017-06-21 22:26:27+01:00   \n",
       "BHKShUjD_Ep    10344507      8     611   1080 2016-06-27 15:10:15+01:00   \n",
       "BYfb69XBCFY  1025124696     85    1080   1080 2017-09-01 09:10:08+01:00   \n",
       "BdnZR_KBmaz  5926038790     20    1080   1080 2018-01-06 16:57:58+00:00   \n",
       "BuQyW36nusS  7289076068    103     607   1080 2019-02-24 11:08:15+00:00   \n",
       "\n",
       "                                                          path  coder_1  \\\n",
       "code                                                                      \n",
       "BVnd0UUlTmv  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        1   \n",
       "BHKShUjD_Ep  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        1   \n",
       "BYfb69XBCFY  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        4   \n",
       "BdnZR_KBmaz  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3   \n",
       "BuQyW36nusS  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3   \n",
       "\n",
       "             coder_2  \n",
       "code                  \n",
       "BVnd0UUlTmv        2  \n",
       "BHKShUjD_Ep        4  \n",
       "BYfb69XBCFY        3  \n",
       "BdnZR_KBmaz        2  \n",
       "BuQyW36nusS        4  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unix</th>\n",
       "      <th>caption</th>\n",
       "      <th>ownerid</th>\n",
       "      <th>likes</th>\n",
       "      <th>height</th>\n",
       "      <th>width</th>\n",
       "      <th>time</th>\n",
       "      <th>path</th>\n",
       "      <th>coder_1</th>\n",
       "      <th>coder_2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BhmLShznCPM</th>\n",
       "      <td>1523806922</td>\n",
       "      <td>I took so many pictures at Rufford yesterday d...</td>\n",
       "      <td>20251612</td>\n",
       "      <td>141</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-04-15 16:42:02+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bo9oO4ngzse</th>\n",
       "      <td>1539626234</td>\n",
       "      <td>Out ya foot in it!!\\r\\n#walknotts #walking #hi...</td>\n",
       "      <td>7300930727</td>\n",
       "      <td>9</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-10-15 18:57:14+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BWs6SMMlHb8</th>\n",
       "      <td>1500410568</td>\n",
       "      <td>#pathway #ruffordabbey #trees #treelinedpath #...</td>\n",
       "      <td>3308977107</td>\n",
       "      <td>12</td>\n",
       "      <td>1350</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-07-18 21:42:48+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BYWvnvtlCyc</th>\n",
       "      <td>1503961747</td>\n",
       "      <td>Bank holiday done! Bring on a week's annual le...</td>\n",
       "      <td>2032802212</td>\n",
       "      <td>5</td>\n",
       "      <td>1349</td>\n",
       "      <td>1080</td>\n",
       "      <td>2017-08-29 00:09:07+01:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BeyKB_FF76f</th>\n",
       "      <td>1517766465</td>\n",
       "      <td>Magpie looking as haunting as ever watching ov...</td>\n",
       "      <td>20251612</td>\n",
       "      <td>90</td>\n",
       "      <td>1080</td>\n",
       "      <td>1080</td>\n",
       "      <td>2018-02-04 17:47:45+00:00</td>\n",
       "      <td>C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   unix                                            caption  \\\n",
       "code                                                                         \n",
       "BhmLShznCPM  1523806922  I took so many pictures at Rufford yesterday d...   \n",
       "Bo9oO4ngzse  1539626234  Out ya foot in it!!\\r\\n#walknotts #walking #hi...   \n",
       "BWs6SMMlHb8  1500410568  #pathway #ruffordabbey #trees #treelinedpath #...   \n",
       "BYWvnvtlCyc  1503961747  Bank holiday done! Bring on a week's annual le...   \n",
       "BeyKB_FF76f  1517766465  Magpie looking as haunting as ever watching ov...   \n",
       "\n",
       "                ownerid  likes  height  width                      time  \\\n",
       "code                                                                      \n",
       "BhmLShznCPM    20251612    141    1080   1080 2018-04-15 16:42:02+01:00   \n",
       "Bo9oO4ngzse  7300930727      9    1080   1080 2018-10-15 18:57:14+01:00   \n",
       "BWs6SMMlHb8  3308977107     12    1350   1080 2017-07-18 21:42:48+01:00   \n",
       "BYWvnvtlCyc  2032802212      5    1349   1080 2017-08-29 00:09:07+01:00   \n",
       "BeyKB_FF76f    20251612     90    1080   1080 2018-02-04 17:47:45+00:00   \n",
       "\n",
       "                                                          path  coder_1  \\\n",
       "code                                                                      \n",
       "BhmLShznCPM  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        2   \n",
       "Bo9oO4ngzse  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3   \n",
       "BWs6SMMlHb8  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3   \n",
       "BYWvnvtlCyc  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        4   \n",
       "BeyKB_FF76f  C:\\Users\\tania\\Documents\\SDS\\Thesis\\Pipeline\\A...        3   \n",
       "\n",
       "             coder_2  \n",
       "code                  \n",
       "BhmLShznCPM        4  \n",
       "Bo9oO4ngzse        2  \n",
       "BWs6SMMlHb8        1  \n",
       "BYWvnvtlCyc        2  \n",
       "BeyKB_FF76f        4  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(rollright_annot.head())\n",
    "display(rufford_annot.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 254,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # avoid re-running cell\n",
    "with open('rollright_annot_coder.pickle', 'wb') as f:\n",
    "    pickle.dump(rollright_annot,f,pickle.HIGHEST_PROTOCOL)\n",
    "with open('rufford_annot_coder.pickle', 'wb') as f:\n",
    "    pickle.dump(rufford_annot,f,pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create new directories containing images to be annotated for each coding (first and second) for each coder (`1`,`2`,`3`,`4`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 274,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createCodingDirectories(sitename,annot_df):\n",
    "    \"\"\"\n",
    "    Input: sitename(str)-name of site\n",
    "           annot_df(DataFrame)-DF of 1,000 images to be annotated\n",
    "    Copies images to be annotated into directories for each coding*coder\n",
    "    Returns: True if executed without error\n",
    "    \"\"\"\n",
    "    # Parent directory for site\n",
    "    annot_dir = os.path.join(CWD,\"{}_annot\".format(sitename))\n",
    "    if not os.path.exists(annot_dir):\n",
    "        os.mkdir(annot_dir)\n",
    "\n",
    "    # Child directory for first coding\n",
    "    annot_firstcoding_dir = os.path.join(annot_dir,\"first_coding\")\n",
    "    if not os.path.exists(annot_firstcoding_dir):\n",
    "        os.mkdir(annot_firstcoding_dir)\n",
    "\n",
    "    # Child directory for second coding\n",
    "    annot_secondcoding_dir = os.path.join(annot_dir,\"second_coding\")\n",
    "    if not os.path.exists(annot_secondcoding_dir):\n",
    "        os.mkdir(annot_secondcoding_dir)\n",
    "\n",
    "    # Grandchild directories for each coder for each coding\n",
    "    for i, coding_dir in enumerate([annot_firstcoding_dir,annot_secondcoding_dir]):\n",
    "        for coder in [1,2,3,4]:\n",
    "            print('For coding {} for coder {}:'.format(i+1,coder))\n",
    "            coder_dir = os.path.join(coding_dir,str(coder))\n",
    "            if not os.path.exists(coder_dir):\n",
    "                os.mkdir(coder_dir)\n",
    "            image_paths = list(annot_df.loc[annot_df['coder_{}'.format(i+1)]==coder]['path'])\n",
    "            for j, path in enumerate(image_paths):\n",
    "                shutil.copy2(path, coder_dir)\n",
    "                if (j+1)%100==0 or (j+1)==len(image_paths):\n",
    "                    print('\\tCopied {}th image'.format(j+1))\n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 275,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For coding 1 for coder 1:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 2:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 3:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 4:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 2 for coder 1:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 246th image\n",
      "For coding 2 for coder 2:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 2 for coder 3:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 2 for coder 4:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 254th image\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 275,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "createCodingDirectories('rollright',rollright_annot)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 278,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For coding 1 for coder 1:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 2:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 3:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 1 for coder 4:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 250th image\n",
      "For coding 2 for coder 1:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 245th image\n",
      "For coding 2 for coder 2:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 251th image\n",
      "For coding 2 for coder 3:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 251th image\n",
      "For coding 2 for coder 4:\n",
      "\tCopied 100th image\n",
      "\tCopied 200th image\n",
      "\tCopied 253th image\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 278,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "createCodingDirectories('rufford',rufford_annot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Export DataFrames to ``.xlsx``.\n",
    "_Note_: Excel doesn't support ``tzinfo`` in column ``time``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 287,
   "metadata": {},
   "outputs": [],
   "source": [
    "1/0 # to avoid re-running cell\n",
    "rollright5_df[['unix','caption','ownerid','likes']].to_excel('rollright5.xlsx')\n",
    "rufford5_df[['unix','caption','ownerid','likes']].to_excel('rufford5.xlsx')\n",
    "rollright_sample[['unix','caption','ownerid','likes','annot']].to_excel('rollright_sample.xlsx')\n",
    "rufford_sample[['unix','caption','ownerid','likes','annot']].to_excel('rufford_sample.xlsx')\n",
    "rollright_annot[['unix','caption','ownerid','likes','coder_1','coder_2']].to_excel('rollright_annot.xlsx')\n",
    "rufford_annot[['unix','caption','ownerid','likes','coder_1','coder_2']].to_excel('rufford_annot.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "toc-autonumbering": false
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
